The idea is to develop a machine learning model that accurately identifies amphibian and insect species and provides information about their characteristics and potential danger to humans. This project addresses the need for precise identification and can benefit conservation efforts and public safety. Gathering a large dataset of images and species information is the first step, followed by training the model using machine learning algorithms. Ethical considerations are important, and success is achieved by obtaining a comprehensive dataset for analysis and model training.
I will begin by importing the necessary libraries and importing the dataset.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
The dataset utilized for this project is sourced from https://observation.org/. This website serves as a platform where users share their observations of various species, accompanied by corresponding images. The platform enables individuals to publish their findings and contribute valuable data regarding different species. By leveraging this dataset, the project gains access to a diverse range of species observations and accompanying images, enriching the information available for training the machine learning model. The inclusion of such a comprehensive and community-driven dataset enhances the accuracy and effectiveness of the model in identifying and analyzing amphibian and insect species.
import pandas as pd
# Load the two datasets from different directories
file_path1 = 'Data/observationsInsectSpider.csv'
file_path2 = 'Data/observationsSnakesAmphibia.csv'
df1 = pd.read_csv(file_path1)
df2 = pd.read_csv(file_path2)
# concatenate the two datasets
observation_df = pd.concat([df1, df2], ignore_index=True)
# sort the resulting dataframe by the "id" column
observation_df = observation_df.sort_values(by="id")
observation_df
| id | observed_on_string | observed_on | time_observed_at | time_zone | user_id | user_login | user_name | created_at | updated_at | ... | geoprivacy | taxon_geoprivacy | coordinates_obscured | positioning_method | positioning_device | species_guess | scientific_name | common_name | iconic_taxon_name | taxon_id | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 199440 | 39 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | Taricha torosa | California Newt | Amphibia | 27818 |
| 199441 | 40 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | Taricha torosa | California Newt | Amphibia | 27818 |
| 199442 | 80 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | Callisaurus draconoides | Zebra-tailed Lizard | Reptilia | 36080 |
| 0 | 203 | March 18, 2008 12:00 | 2008-03-18 | 2008-03-18 19:00:00 UTC | Pacific Time (US & Canada) | 49.0 | alan99 | NaN | 2008-05-05 08:32:40 UTC | 2023-03-06 04:56:46 UTC | ... | NaN | NaN | False | NaN | NaN | Ranchman's Tiger Moth | Arctia virginalis | Ranchman's Tiger Moth | Insecta | 626880 |
| 1 | 523 | 2008-07-13 | 2008-07-13 | NaN | Eastern Time (US & Canada) | 1.0 | kueda | Ken-ichi Ueda | 2008-07-26 00:22:46 UTC | 2022-12-25 18:17:48 UTC | ... | NaN | NaN | False | NaN | NaN | Forest Tent Caterpillar Moth | Malacosoma disstria | Forest Tent Caterpillar Moth | Insecta | 81663 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 303676 | 156384671 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | Duttaphrynus melanostictus | Asian Common Toad | Amphibia | 62345 |
| 303677 | 156387550 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | Bufo bufo | Gewone Pad | Amphibia | 326296 |
| 303678 | 156391245 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | Crotalus adamanteus | Eastern Diamondback Rattlesnake | Reptilia | 53491 |
| 303679 | 156392005 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | Hyla arborea | Boomkikker | Amphibia | 424147 |
| 303680 | 156400272 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | Zamenis scalaris | Trapslang | Reptilia | 540328 |
303681 rows × 39 columns
observation_df.isna().sum()
id 0 observed_on_string 104528 observed_on 104532 time_observed_at 115994 time_zone 104242 user_id 104241 user_login 104241 user_name 155945 created_at 104241 updated_at 104241 quality_grade 104241 license 150394 url 104241 image_url 1032 sound_url 303265 tag_list 288257 description 233379 num_identification_agreements 104241 num_identification_disagreements 104241 captive_cultivated 104241 oauth_application_id 219374 place_guess 104269 latitude 104241 longitude 104241 positional_accuracy 143310 private_place_guess 303681 private_latitude 303681 private_longitude 303681 public_positional_accuracy 143252 geoprivacy 303681 taxon_geoprivacy 280312 coordinates_obscured 104241 positioning_method 273362 positioning_device 271583 species_guess 106951 scientific_name 1 common_name 27771 iconic_taxon_name 0 taxon_id 0 dtype: int64
# Check how many unique species are in the dataset
observation_df['scientific_name'].nunique()
21196
There are 21.196 distinct species in this dataset, but since there are 199,439 rows, it appears that some of the species are duplicated within the dataset.
observation_df['iconic_taxon_name'].unique()
array(['Amphibia', 'Reptilia', 'Insecta', 'Arachnida'], dtype=object)
The dataset has 4 different categories:
In this step, I will eliminate the unnecessary columns for this application. Then, I will examine the dataset for null values and remove any duplicated species within the dataset.
observation_df.head()
| id | observed_on_string | observed_on | time_observed_at | time_zone | user_id | user_login | user_name | created_at | updated_at | ... | geoprivacy | taxon_geoprivacy | coordinates_obscured | positioning_method | positioning_device | species_guess | scientific_name | common_name | iconic_taxon_name | taxon_id | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 199440 | 39 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | Taricha torosa | California Newt | Amphibia | 27818 |
| 199441 | 40 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | Taricha torosa | California Newt | Amphibia | 27818 |
| 199442 | 80 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | Callisaurus draconoides | Zebra-tailed Lizard | Reptilia | 36080 |
| 0 | 203 | March 18, 2008 12:00 | 2008-03-18 | 2008-03-18 19:00:00 UTC | Pacific Time (US & Canada) | 49.0 | alan99 | NaN | 2008-05-05 08:32:40 UTC | 2023-03-06 04:56:46 UTC | ... | NaN | NaN | False | NaN | NaN | Ranchman's Tiger Moth | Arctia virginalis | Ranchman's Tiger Moth | Insecta | 626880 |
| 1 | 523 | 2008-07-13 | 2008-07-13 | NaN | Eastern Time (US & Canada) | 1.0 | kueda | Ken-ichi Ueda | 2008-07-26 00:22:46 UTC | 2022-12-25 18:17:48 UTC | ... | NaN | NaN | False | NaN | NaN | Forest Tent Caterpillar Moth | Malacosoma disstria | Forest Tent Caterpillar Moth | Insecta | 81663 |
5 rows × 39 columns
The columns I want to keep in this dataset are:
The other columns can be removed.
columns_to_drop = [
'observed_on_string', 'observed_on', 'time_observed_at', 'time_zone',
'user_id', 'user_login', 'user_name', 'created_at', 'updated_at',
'quality_grade', 'license', 'url', 'sound_url', 'tag_list',
'description', 'num_identification_agreements',
'num_identification_disagreements', 'captive_cultivated', 'oauth_application_id',
'place_guess', 'latitude', 'longitude', 'positional_accuracy',
'public_positional_accuracy', 'geoprivacy', 'taxon_geoprivacy',
'coordinates_obscured', 'positioning_method', 'positioning_device',
'species_guess', 'private_place_guess', 'private_latitude', 'private_longitude', 'taxon_id'
]
data_filtered = observation_df.drop(columns=columns_to_drop)
data_filtered
| id | image_url | scientific_name | common_name | iconic_taxon_name | |
|---|---|---|---|---|---|
| 199440 | 39 | https://inaturalist-open-data.s3.amazonaws.com... | Taricha torosa | California Newt | Amphibia |
| 199441 | 40 | https://inaturalist-open-data.s3.amazonaws.com... | Taricha torosa | California Newt | Amphibia |
| 199442 | 80 | https://inaturalist-open-data.s3.amazonaws.com... | Callisaurus draconoides | Zebra-tailed Lizard | Reptilia |
| 0 | 203 | http://static.inaturalist.org/photos/132/mediu... | Arctia virginalis | Ranchman's Tiger Moth | Insecta |
| 1 | 523 | https://inaturalist-open-data.s3.amazonaws.com... | Malacosoma disstria | Forest Tent Caterpillar Moth | Insecta |
| ... | ... | ... | ... | ... | ... |
| 303676 | 156384671 | https://static.inaturalist.org/photos/27041880... | Duttaphrynus melanostictus | Asian Common Toad | Amphibia |
| 303677 | 156387550 | https://inaturalist-open-data.s3.amazonaws.com... | Bufo bufo | Gewone Pad | Amphibia |
| 303678 | 156391245 | https://static.inaturalist.org/photos/27043003... | Crotalus adamanteus | Eastern Diamondback Rattlesnake | Reptilia |
| 303679 | 156392005 | https://inaturalist-open-data.s3.amazonaws.com... | Hyla arborea | Boomkikker | Amphibia |
| 303680 | 156400272 | https://inaturalist-open-data.s3.amazonaws.com... | Zamenis scalaris | Trapslang | Reptilia |
303681 rows × 5 columns
Given that the dataset is considerably smaller now, I intend to review it once more to identify any missing or duplicate values.
data_filtered.isna().sum()
id 0 image_url 1032 scientific_name 1 common_name 27771 iconic_taxon_name 0 dtype: int64
There is one scientific_name missing in this dataset, let's have a look at this row to check if it can be removed or not.
missing_scientific_name = data_filtered[data_filtered['scientific_name'].isna()]
print(missing_scientific_name)
id image_url \
197083 152245435 https://static.inaturalist.org/photos/26286780...
scientific_name common_name iconic_taxon_name
197083 NaN NaN Insecta
This particular row does not provide any relevant information about the observation, and therefore, can be eliminated.
# remove the empty row
data_filtered = data_filtered.dropna(subset=['scientific_name'])
data_filtered.isna().sum()
id 0 image_url 1032 scientific_name 0 common_name 27770 iconic_taxon_name 0 dtype: int64
# Check how many unique species are in the dataset
data_filtered['scientific_name'].nunique()
21196
In the last part of this EDA, I'm going to visualize the data.
import plotly.graph_objs as go
import plotly.io as pio
# Enable notebook renderer
pio.renderers.default = 'notebook'
# Counting how many unique species per category
insecta_unique_count = data_filtered[data_filtered['iconic_taxon_name'] == 'Insecta']['scientific_name'].nunique()
arachnida_unique_count = data_filtered[data_filtered['iconic_taxon_name'] == 'Arachnida']['scientific_name'].nunique()
amphibia_unique_count = data_filtered[data_filtered['iconic_taxon_name'] == 'Amphibia']['scientific_name'].nunique()
reptilia_unique_count = data_filtered[data_filtered['iconic_taxon_name'] == 'Reptilia']['scientific_name'].nunique()
# Create a list of tuples with taxon groups and their unique species counts
taxon_counts = [
('Insecta', insecta_unique_count),
('Arachnida', arachnida_unique_count),
('Amphibia', amphibia_unique_count),
('Reptilia', reptilia_unique_count)
]
# Sort the list in descending order based on unique species count
taxon_counts.sort(key=lambda x: x[1], reverse=True)
# Separate the sorted taxon groups and species counts into separate lists
sorted_taxon_groups, sorted_species_counts = zip(*taxon_counts)
# Create a bar plot
fig = go.Figure()
fig.add_trace(go.Bar(
x=list(sorted_taxon_groups),
y=list(sorted_species_counts),
text=list(sorted_species_counts),
textposition='auto',
marker_color=['#377eb8', '#e41a1c', '#4daf4a', '#984ea3'] # Set colorblind-friendly colors for the bars
))
# Customize the plot layout
fig.update_layout(
title='Unique Insecta, Arachnida, Amphibia, and Reptilia Species',
xaxis_title='Taxon Group',
yaxis_title='Unique Species Count'
)
# Display the interactive plot
fig.show()
print(f"Number of unique Amphibia species: {amphibia_unique_count}")
print(f"Number of unique Reptilia species: {reptilia_unique_count}")
print(f"Number of unique Insecta species: {insecta_unique_count}")
print(f"Number of unique Arachnida species: {arachnida_unique_count}")
Number of unique Amphibia species: 1861 Number of unique Reptilia species: 2364 Number of unique Insecta species: 15444 Number of unique Arachnida species: 1527
N = 10
top_species = data_filtered['scientific_name'].value_counts().head(10)
sns.barplot(x=top_species.index, y=top_species.values)
plt.xticks(rotation=55)
plt.ylabel('Number of Observations')
plt.title(f'Top {N} Most Common Species')
plt.show()
# Create a copy of the filtered DataFrame
data_filtered_copy = data_filtered.copy()
# Create a new column 'has_common_name' in the copied DataFrame
data_filtered_copy['has_common_name'] = ~data_filtered_copy['common_name'].isna()
# Create the count plot
sns.countplot(x='has_common_name', data=data_filtered_copy)
plt.title('Observations with and without Common Name')
plt.show()
import plotly.graph_objs as go
def get_top_species(data, taxon, N=10):
data_taxon = data[data['iconic_taxon_name'] == taxon]
top_species = data_taxon['scientific_name'].value_counts().head(N)
return top_species
top_insect_species = get_top_species(data_filtered, 'Insecta')
top_arachnida_species = get_top_species(data_filtered, 'Arachnida')
top_amphibia_species = get_top_species(data_filtered, 'Amphibia')
top_reptilia_species = get_top_species(data_filtered, 'Reptilia')
fig = go.Figure()
fig.add_trace(go.Bar(x=top_insect_species.index,
y=top_insect_species.values,
name='Insecta',
marker_color='rgb(58, 200, 225)'))
fig.add_trace(go.Bar(x=top_arachnida_species.index,
y=top_arachnida_species.values,
name='Arachnida',
marker_color='rgb(58, 71, 80)'))
fig.add_trace(go.Bar(x=top_amphibia_species.index,
y=top_amphibia_species.values,
name='Amphibia',
marker_color='rgb(204, 204, 0)'))
fig.add_trace(go.Bar(x=top_reptilia_species.index,
y=top_reptilia_species.values,
name='Reptilia',
marker_color='rgb(229, 121, 36)'))
fig.update_layout(
title='Top 10 Most Common Species for Insecta, Arachnida, Amphibia, and Reptilia',
xaxis=dict(title='Species'),
yaxis=dict(title='Number of Observations'),
legend=dict(x=0, y=1.0),
barmode='group',
bargap=0.15,
bargroupgap=0.1,
plot_bgcolor='white',
xaxis_tickangle=-45
)
fig.show()
data_filtered
| id | image_url | scientific_name | common_name | iconic_taxon_name | |
|---|---|---|---|---|---|
| 199440 | 39 | https://inaturalist-open-data.s3.amazonaws.com... | Taricha torosa | California Newt | Amphibia |
| 199441 | 40 | https://inaturalist-open-data.s3.amazonaws.com... | Taricha torosa | California Newt | Amphibia |
| 199442 | 80 | https://inaturalist-open-data.s3.amazonaws.com... | Callisaurus draconoides | Zebra-tailed Lizard | Reptilia |
| 0 | 203 | http://static.inaturalist.org/photos/132/mediu... | Arctia virginalis | Ranchman's Tiger Moth | Insecta |
| 1 | 523 | https://inaturalist-open-data.s3.amazonaws.com... | Malacosoma disstria | Forest Tent Caterpillar Moth | Insecta |
| ... | ... | ... | ... | ... | ... |
| 303676 | 156384671 | https://static.inaturalist.org/photos/27041880... | Duttaphrynus melanostictus | Asian Common Toad | Amphibia |
| 303677 | 156387550 | https://inaturalist-open-data.s3.amazonaws.com... | Bufo bufo | Gewone Pad | Amphibia |
| 303678 | 156391245 | https://static.inaturalist.org/photos/27043003... | Crotalus adamanteus | Eastern Diamondback Rattlesnake | Reptilia |
| 303679 | 156392005 | https://inaturalist-open-data.s3.amazonaws.com... | Hyla arborea | Boomkikker | Amphibia |
| 303680 | 156400272 | https://inaturalist-open-data.s3.amazonaws.com... | Zamenis scalaris | Trapslang | Reptilia |
303680 rows × 5 columns
In this analysis, we performed an Exploratory Data Analysis (EDA) on a dataset containing observations of Insecta, Arachnida, Amphibia, and Reptilia. We aimed to explore the data and understand the distribution of species and the availability of common names in the dataset.
First, we filtered the dataset to include only relevant columns and taxon groups, focusing on insects, arachnids, amphibians, and reptiles. We checked for missing values and removed rows with missing scientific names. We visualized the distribution of observations across the top 10 most observed species and calculated the proportion of observations with an associated common name.
Next, we created a series of visualizations to better understand the data:
These visualizations helped us gain insights into the taxonomic diversity of the dataset and identify potential imbalances or patterns that could affect a machine learning model's performance.
Finally, we created an interactive plot that displays the top 10 most common species for Insecta, Arachnida, Amphibia, and Reptilia. The interactive plot allows for easy exploration and comparison of the most common species in all four taxonomic groups.
In conclusion, the EDA we performed provided valuable insights into the dataset's structure, distribution, and potential issues. These insights can inform the data preparation process and guide the development of a machine learning model to identify different species of insects, arachnids, amphibians, and reptiles, as well as provide information about their characteristics and potential danger level to humans.
# Specify the file path for the new CSV file
new_file_path = 'Data/observation.csv'
# Save the updated DataFrame to the new CSV file
data_filtered.to_csv(new_file_path, index=False)